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One unfortunate characteristic of most highly visible educational accountability systems is their close 
tie to a single or very few consequential levels of academic achievement. For example, the Adequate 
Yearly Progress provision of the No Child Left Behind Act of 2001 focuses exclusively on a 
‘‘proficiency” level of achievement. Since attainment of “proficiency” is the sole level for which 
credit is granted in this system, the concern is that students making good progress but not enough to be 
considered “proficient” receive no credit toward an index of being “accountable”. Over time, students 
who are considered too far from the “proficiency” level to be able to attain it by assessment time, run 
the risk of losing instructional attention in favor of more “proficient probable” students. Similarly, 
students who are obviously beyond the key “proficiency” level can also lose instructional attention. 
Any positive change in their status will not affect the accountability index. 

Various authors have addressed this dilemma. For example, Linn, Baker, and Betebenner (2002) 
proposed a system that assigns fractional credit to performance categories other than “proficient”. 
Flicek and Lowham (2001) proposed using individual student grow1:h referenced to longitudinal 
growth norms as a method of incorporating and giving credit for progress made, even though the end 
performance status might fall short of a performance criterion. Kingsbury (2000) proposed a “hybrid 
success model” for setting individual student growth expectations for students based on their 
proximity to an achievement target, thus allowing both status and growth to demonstrate 
accountability. What distinguishes the Flicek and Lowham and the Kingsbury proposals is the central 
role each assigns to individual student growth within an accountability scheme. Both consider 
individual growth as an integral part to accountability, not merely as an optional supplement (or 
worse, an interesting side note) to performance status. 

This paper is predicated on a rather simple argument: in order for academic growth to serve in a 
fundamental role in an accountability system, the amount of growth a student would reasonably be 
expected to attain over some set time interval (i.e., a growth expectation, standard, or target) must be 
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able to be declared in advance. These declarations, or others based on them, will typically be 
translated into a form of value within an accountability scheme. This value, in turn, will be at least 
part of the evidence forjudging the extent to which the school (or district or state) is being successful 
or ‘accountable’. For example, a district expectation might be that all 4^*^ grade students grow by X 
amount. While this expectation is certainly convenient, its reasonableness is open to question. Is it 
reasonable to assume that all students in a single grade would grow at the same rate? For a high 
achieving student, requiring average grade level growth will likely be more demanding than requiring 
average grade level growth from a lower achieving student. A lower achieving student might even be 
thought of as “under-challenged”. Neither student would be treated equitably. 

A ‘reasonable’ growth target can be thought of as the proximity between the observed growth and the 
expected growth; the closer the observed growth is to expected growth, the more reasonable the 
growth target. This position implies that observed growth that is substantially greater than the target is 
no more or less reasonable than observed growth that is substantially less than the target. With a focus 
on individual student growth, it should be possible to create a method of defining reasonable, equitable 
growth targets for each student using characteristics of the individual student’s past performance. 

There is already strong evidence, for example, that the rate of growth is often associated with initial 
student achievement status (e.g., NWEA, 2002; Seltzer, Choi, & Thum, 2002a, 2002b). 

The purpose of this study was to evaluate several feasible models for determining single-year 
academic growth targets for individual students. These models are detailed in the next section. 
Single-year growth targets were considered as the most likely points from which declarations of the 
value of observed growth would be defined for use in an accountability system (e.g., “value added” 
systems). This study was undertaken as an initial, empirical exploration of some of the territory 
involved in this area. The study is certainly not definitive, though it holds implications for questions 
such as: ‘How much academic growth can we reasonably expect a student to make over the course of a 
year?’; ‘Is it reasonable to ask all students in the same grade to grow at the same rate?’; ‘Can the 
observed growth of large numbers of students who were in the same grade level and in the same 
achievement range, help to define reasonable growth?’. The study does not address how growth data, 
per se, should be used in an accountability system, only on how an equitable baseline of growth could 
be established. 
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Methods 



Data sources. 

Data for the study came from three cohorts of student test records. Two of the cohorts (A and B) were 
from a single moderate sized school district in Wyoming. The district has 28 elementary and four 
middle schools and a total student population of slightly over 12,000. For these cohorts there were 
four waves each of spring achievement data in reading and mathematics (spring 1999 through spring 
2002). The third cohort (C) came from the Northwest Evaluation Association 2002 RIT Scale Norms 
Study. The test records making up this cohort are from students in nine districts in six states. In this 
set there were 10 waves of fall and spring achievement data in reading and mathematics (fall 1996 
through spring 2001). For Cohorts A and B, the last wave contained the scores to be predicted. In 
Cohort C, the last wave also contained the scores to be predicted. But in Cohort C, the ninth wave 
(fall 2000) was not considered as observed. In all cohort datasets, only those student records 
containing complete test data for a subject area were included in the analyses for that area. Thus, for 
example, a particular student’s complete reading test data would be included even though their 
mathematics test data were incomplete (and not included). These cohort characteristics are 
summarized in Table 1. 



Table 1. Characteristics of the cohort data sets 



Cohort 


Districts 

represented 


Total 

waves 


"Observed" waves used 
for prediction 
term, year ^ 


Predicted 
term, year / 
grade 


Reading 


Math 


A 


1 


4 


S99, SOO, SOI 


S02/5 


655 


659 


B 


1 


4 


S99, SOO, SOI 


S02/6 


738 


742 


C 


9 


10 


F96, F97, F98, F99, 
S97, S98, S99, SOO 


SOI /8 


3876 


4132 



a 




F = fall; S = spring 
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Table 2 presents achievement data for the three cohorts. Achievement levels between the cohorts were 
comparable in common grades for the spring terms. Variance in common grades in Reading tended to 
be slightly higher in Cohorts A and B than for Cohort C. The reverse was true in Mathematics. In 
Mathematics for Cohort C, a trend of increasing variance from the first wave to the last was observed. 



Table 2. Descriptive statistics of cohort performance in reading and mathematics by season and year. 



Cohort A 



^ Reading Matematics 

Season- 



Year 


Grd 


Med. 


Mean 


SD 


Min. 


Max. 


N 


Med. 


Mean 


SD 


Min. 


Max. 


N 


S-99 


2 


192 


188.7 


14.72 


148 


226 


655 


190 


189.0 


12.54 


144 


226 


659 


S-00 


3 


202 


199.0 


13.81 


152 


234 


655 


203 


201.6 


12.42 


148 


239 


659 


S-01 


4 


210 


207.3 


14.03 


144 


239 


655 


213 


212.2 


11.07 


159 


250 


659 


S-02* 


5 


215 


213.9 


12.78 


164 


252 


655 


222 


221.9 


12.15 


174 


264 


659 
















Cohort B 












S-99 


3 


201 


198.5 


13.83 


148 


232 


738 


202 


200.5 


11.57 


161 


229 


742 


S-00 


4 


209 


206.4 


12.77 


152 


237 


738 


211 


210.9 


11.58 


168 


255 


742 


S-01 


5 


215 


213.7 


12.95 


154 


247 


738 


220 


220.2 


12.45 


177 


254 


742 


S-02* 


6 


220 


219.1 


12.40 


155 


247 


738 


228 


226.8 


12.73 


180 


262 


742 
















Cohort C 












F-96 


4 


204 


201.7 


13.49 


143 


233 


3876 


201 


200.4 


11.30 


149 


247 


4132 


S-97 


4 


210 


208.3 


13.20 


143 


243 


3876 


210 


209.4 


12.13 


154 


255 


4132 


F-97 


5 


211 


209.1 


12.96 


147 


241 


3876 


210 


209.5 


12.42 


155 


252 


4132 


S-98 


5 


216 


214.6 


12.67 


154 


251 


3876 


218 


218.0 


12.97 


150 


263 


4132 


F-98 


6 


217 


215.5 


12.12 


155 


250 


3876 


218 


217.2 


13.10 


172 


261 


4132 


S-99 


6 


222 


220.1 


11.97 


156 


258 


3876 


225 


225.5 


14.88 


172 


278 


4132 


F-99 


7 


222 


220.1 


11.68 


156 


256 


3876 


226 


226.2 


15.25 


160 


282 


4132 


S-00 


7 


225 


223.5 


12.37 


166 


261 


3876 


234 


233.9 


16.35 


171 


293 


4132 


F-00 


8 


225 


224.3 


11.95 


160 


268 


3876 


235 


234.5 


16.56 


164 


290 


4132 


S-01* 


8 


230 


229.1 


11.60 


165 


269 


3876 


242 


242.1 


17.10 


184 


294 


4132 



* Designates the season-year for which scores were predicted. 
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Tests characteristics. All tests used in this study were created from the NWEA item banks in Reading 
and Mathematics. These banks are comprised of several thousand test items that have been calibrated 
for difficulty using the one-parameter Item Response Theory (IRT) model (Rasch model). Item 
difficulty and student ability are both expressed in Rasch Units (RITs) on the same scale. A RIT is 
simply the linear transformation of the logit theta metric that sets the unit at .10 logits and centers the 
scale at 200 (i.e., RIT = 0*10 + 200). Thus, a RIT of 210 is equivalent to logit = 1. There is one 
scale for Reading and one scale for Mathematics. Paper and pencil Achievement Level Tests in 
Reading can measure dependably from about RIT 149, ±3.6 (percentile 2 in fall grade 2) to about RIT 
252, ±5.1 (percentile 98 in spring grade 10). In Mathematics, paper and pencil tests measure 
accurately from about RIT 156, ±3.8 (percentile 2, fall grade 2) to about RIT 276, ±5.5 (> percentile 
98 in spring grade 10). Well-targeted level tests typically have measurement error in the 2.8 - 3.3 
range. Computerized-adaptive versions extend slightly the measurement ranges with these levels of 
associated measurement error. A complete description of the technical characteristics of NWEA tests 
can be found in the NWEA Technical Manual for Achievement Level Tests and Measures of Academic 
Progress (2003). 

NWEA RIT Scale Norms. Several of the models used to determine individual student growth targets 
used data reported in the NWEA 2002 norms study. This study includes the test records of 
approximately 1 .05 million students representing 321 school districts in 24 states. The districts ranged 
from very urban to very rural. They ranged in size from under 200 to over 60,000 students. 

The norms study provided several specific data elements. Grade level means and standard deviations 
of student status and growth in the grades of interest were used. For status level data, these were 
based on roughly 71,000 to 89,000 students per grade level. Grade level growth means were based on 
intact groups of students; that is, student growth was based on the same students having both scores 
used to calculate a change (growth) score. Spring-to-spring grade level growth means were based on 
roughly 44,000 to 54,000 students per grade level. Growth means were also retrieved that were 
disaggregated by the starting status level of students. These means were calculated for all students 
whose achievement status at the beginning of the comparison period fell into each 10 point RIT block. 
RIT blocks were set at 140-149, 150-159, 160-169, .... —260-269. The numbers of students used 

to compute the means in these RIT block cells ranged from 258 to over 14,000. Average N’s for all 
RIT block cells were 4427 for Reading and 4495 for Mathematics. Spring-to-spring growth 
distributions are summarized in Tables 3a, 3b for Reading and in Tales 4a, and 4b for Mathematics. 
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Table 3a. Means and standard deviations of spring-to-spring achievement growth in 
Reading by grade level and initial RIT block 
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Table 4a. Means and standard deviations of spring-to-spring achievement growth in 
Mathematics by grade level and initial RIT block 
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Models for determining individual student growth targets. 

All the models investigated yielded a prediction of each student’s final term RIT score in the subject 
area being considered. For Cohorts A and B, this was spring 2002, for Cohort C it was spring 2001 . 
Individual student prediction residuals (observed score - predicted score) were used as the basis for 
comparing the models. The models differed in the way the available data (prior to the final term) were 
treated and combined with a growth estimate to arrive at a prediction. Models based on mean z-score 
status were the only models not to include an explicit estimate of growth. Some models used growth 
norm references from the 2002 NWEA norming study. One model used no prior achievement data but 
only the mean observed growth of same grade-level students from the norms study. A second model 
used only the observed RIT score from the spring prior to the final (predicted) term and the mean 
observed growth of students who achieved a similar RIT score at the same grade level from the norms 
study. All other models used all prior RIT scores from a student’s record to arrive at a growth 
estimate for the student. Some used the scores directly while others relied on modeling these scores to 
“true” score estimates using linear modeling (LM). Stated more formally, the models are as follows: 

Mean grade level growth (MGLG): 



Where RiTgi is the observed RIT score for student / in grade g, the final observed grade; |Llg is 
the mean growth of students in the norms study going from grade g to g+7. 

Mean RIT block growth (MRBG): 



Where RITg/ is the observed RIT score for student / in grade g, the final observed grade; |LIrb 
is the mean RIT block growth of students in the norms study going from g to g+7 whose 
achievement in the final observed grade, g was in RIT block, RB. 

Linear Model (LM) least squares slope estimate (LMIsSIp): 



Where RITg, is the observed RIT score for student / in grade g, the final observed grade; tcuxs 
is the LM least squares estimate of growth rate for student i over the entire data collection 
period. 



Yg+i RITgi + ^g^ 




Yg+i - RITgi + Tiii.Ls 



ERIC 



In the LMLSslp model and all the models below that include a linear model (LM) component, the 
linear model component developed was equivalent to the level 1 model of a hierarchical linear model 
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(HLM). The level 1 model was structured as it might be posed in a study of academic achievement 
growth in a school system; that is, without predictor variables and using grade level as the time 
variable. In contrast to a growth study, however, the time variable, grade, was ‘re-centered’ on the last 
observed grade so that it took on a value of 0 while prior grades took on negative values. For example 
in Cohort A, grades 2, 3, and 4 became grades -2, -1, and 0 respectively used to predict grade 5 which 
took on the value of +1 . When fall scores were included in the analyses (Cohort C), the decimals .1 
and .8 were used to distinguish between fall and spring, respectively. Centering on the final 
‘observed’ grade (7.8) resulted in the grades 4.1, 4.8, 5.1, 5.8, 6.1, 6.8, 7.1, and 7.8 being converted to 
-3.7, -3.0, -2.7, -2.0, -1.7, -1.0, -.7, and 0, respectively. All models that included linear components 
were estimated using HLM5 (Raudenbush, Bryk, Cheong & Congdon, 2001). 

It should also be noted that the Cohort C data were analyzed using a linear and a non-linear (quadratic) 
model in order to evaluate best model fit. These analyses supported the use of a linear model over a 
non-linear model for both Reading and Mathematics using spring only and fall and spring data. 

Linear Model (LM) empirical Bayes slope estimate (LMeBSlp): 

Yg+i “ RITgi + T^Ii.EB 

Where RIT^., is the observed RIT score for student i in grade g, the final observed grade; Tiii.EB is 
the LM empirical Bayes estimate of growth rate for student / over the data collection period. 

Linear Model (LM) least squares status estimate with RIT block growth (LMlsSt+MRBG): 

Yg+i = TToi.LSg + MRB.g 

Where Tioi.LSg is the LM least squares estimate of the status for student in grade g, the final 

observed grade; pRB.g is the mean growth of students in the norms study going from grade g to 
g+7 whose achievement in grade, g, was in RIT block, RB. 

Linear Model (LM) empirical Bayes status estimate with RIT block growth 

(LMeBSt+MRBG): 

Yg+i = TToi.EB + MRB.g 

Where Tioi.EB is the LM empirical Bayes estimate of the status for student / in grade g, the final 
observed grade, g; pRB.g is the mean growth of students in the norms study going from grade g 
to g+7 whose achievement in grade, g, was in RIT block, RB. 

Full Linear Model (LM) least squares status and growth rate estimates (FLMls): 

Yg+i = TToi.LS + TTli.LSgti + 

Where TCoi.LS is the LM least squares estimate of the status for student i when the grade metric, ga 
- 0; TTii.LS is the LM least squares estimates of the growth rate for student i over the data 
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collection period; Cti is error. The final observed grade, g, was set to g = 0, and all prior grades 
were reset according to g-1 = -1, g-2 = -2, and so on. 



Full Linear Model (LM) empirical Bayes status and growth rate estimates (FLMeB): 

Yg+i = Ttoi.EB + Tlii.EBgti + Cti 

Where Tioi.EB is the LM empirical Bayes estimate of the status for student i when the grade metric, 
gti = 0; Tiii.EB is the LM least squares estimates of the growth rate for student i over the data 
collection period; Cti is error. The final observed grade, g, was set to g = 0, and all prior grades 
were reset according to g-1 = -1, g-2 = -2, and so on. 



Mean of norms-based z scores (MnbZ): 




• • + Zg-3 + Zg.2 + Zg,i + Zg 
n 



* CTg+i + flg+l 



Where z for the predicted grade, g+7, is the mean of norm-based z’s from all prior tests using the 
respective means and standard deviations, as found in the norms study, from the earliest grade, 
g-«, to the final observed grade, g, and Gg+i and )Lig+i are the standard deviation and the mean, 
respectively of the grade g+1 from the norms study. 



Mean of norms-based z scores with last observed score double weighted (MnbZ*): 





. . + Zg_3 + Zg.2 + Zg.i + 2Zg 
rm 



* C7g+i + flg+1 



Where z for the predicted grade, g+7, is the mean of norm-based z’s from all prior tests using the 
respective means and standard deviations, as found in the norms study, from the earliest grade, 
g-«, to the final observed grade, g which is double-weighted, and Gg+i and )ig+i are the standard 
deviation and the mean, respectively of the grade g+7 from the norms study. 



Mean of locally based z scores (MIbZ): 



* sdg+i + X g+i 

Where z for the predicted grade, g+7, is the mean of locally based z’s from all prior tests using the 
means and standard deviations calculated from scores in the earliest grade, g-«, to the final 

observed grade, g, and sdg+, and X g+i are the local historical standard deviation and the mean, 
respectively of grade g+7. 



g+i - ^g+i 



Zp+i - 



Zg-n . . . + Zg-3 + Zg-2 + Zg.i + Zg 
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Mean of locally based z scores with last observed score double weighted (MIbZ*): 



Yg+i - Zg+i - 



Zg.n . . . + Zg.3 + Zg.2 + Zg.i + 2Zg 



irrr 



* sdg+i + X g+i 



Where z for the predicted grade, g+7, is the mean of locally based z’s from all prior tests using the 
means and standard deviations calculated from scores in the earliest grade, to the final 

observed grade, g, which is double-weighted, and sdg+i and Xg+iare the local historical 
standard deviation and the mean, respectively of grade g-f7. 

All models except for the last two were applied to the cases in all three cohorts. All data from each 
set were used, with the last score used in prediction (referred to above as the last observed grade, g) 
being the score from the spring one year prior to the spring score being predicted (i.e., grade 
This means that for Cohort C, where the RIT being predicted was for grade 8, the fall grade 8 RIT was 
not used in any of the prediction models. The last two models, locally-based z scores, could only be 
applied to the Cohort A data for two reasons: a) data for Cohort C were collected across districts, thus 
common local means and standard deviations were not available, and b) no historical local data were 
available to supply the means and standard deviations for the predicted grade for Cohort B, grade 6. 



Analysis. 

Residuals at the individual student level (Yg+i - Y g+i) yielded from each of the models were the focus 
of analysis. For each set of predictions from each cohort, several statistics were computed to help 
describe the resulting distribution of residuals. These included the mean residual, the root mean 
square error, and the percent of the cases for each model that yielded the minimum residual across all 
models. To assess how well each model’s uniformity in prediction across the measurement range, 
Pearson product-moment correlations were calculated between the residuals and the last observed RIT 
score. Positive correlations indicate that higher scores will tend to be under-predicted and lower 
scores will tend to be over-predicted. Negative correlations indicate the opposite tendencies. The 
extent of these deviations depends on the magnitude of the correlation. In addition, the percent of 
cases for each model that yielded a predicted score within a reasonable standard error band of the 
observed score was calculated. ‘Reasonable’, here, was considered to be ±3.3 for Reading and ±3.2 
for Mathematics. These values were based on examinations of the error levels observed for well 
targeted tests - raw score 45-65 percent correct. Comparisons between methods were also maintained 
at the descriptive level. More specifically, plots of residuals by the final (observed) score were 




developed to form a more complete understanding of the nature of prediction results of the various 
models. 



12 



Results 



Cohorts A and B. 

Table 5 contains the results of the five basic descriptive statistics for Cohorts A and B for both 
Reading and Mathematics. The asterisks in Table 5 designate the most favorable value for the 
particular descriptive statistic across all models. Similarly, the superscript italic 2’s designate the next 
most favorable value for the statistic. For example, in the Cohort A - Reading results, the MlbZ* 
model was found to have the most favorable mean residual (minimum absolute) value (.18), while the 
MRBG model was the next most favorable (-.18). 

When examining Table 5 within each content area, several commonalities appear. Initially we see that 
the linear models that included the slope parameter (LMlsSlp, LMeBSlp, FLMls, and FLMeB) in the 
prediction, tended to result in an over-prediction bias indicated by large negative values of the mean 
residual. This unfavorable outcome was evident in each of the other indicators. Models involving 
mean RIT block growth (MRBG, LMlsSt+MRBG, and LMeBSt+MRBG) resulted in somewhat more 
favorable results across indicators for Cohort A in both Reading and Mathematics. In fact the linear 
model using empirical Bayes estimates of status with RIT block growth as estimates of rate 
(LMeBSt+MRBG) produced the most favorable results in Reading. In Mathematics, however, the 
model using local-based z scores (MlbZ and MlbZ*) produced the most favorable set of results even 
though results generally under-predicted performance. 

For Cohort B, linear models that included the estimation of grade status from a linear model in 
combination with RIT block growth means as estimates of rate of growth, LMlsSt+MRBG (for 
Reading) and LMeBSt+MRBG (for Mathematics) yielded the most favorable set of indicators. In 
terms of percentage of predictions within the ISEM bands established, the norm-based z-score models 
(MnbZ and MnbZ*) were both favorable for Reading. In Mathematics, the simple models using only 
mean grade level growth (MGLG) and RIT block growth (MRBG) were also quite favorable. In both 
cases, however, the correlations between the residual and the last observed RIT score were too high 
for these models to be considered across the measurement range. 




Table 5. Achievement Status Residuals by Method - Cohorts A and B 
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Cohort C. 

Cohort C results are contained in Table 6. The upper part of the table presents residuals based only on 
spring data while the lower part presents residuals based on both fall and spring data. For both 
Reading and Mathematics, the additional fall data had only minimal effect on bringing the mean 
residual closer to zero. For Reading, the inclusion of fall data into the two full linear models (FLMls 
and FLMeB), the actually introduced more bias into the predictions. However, the large over- 
prediction levels associated with linear models involving a slope parameter that were noted in the 
Cohorts A and B data were not as pronounced in the for Reading and were virtually absent for 
Mathematics. 

Variance (RMSE) in the residuals of the models using linear estimates (LMlsSlp, LmeBSlp, 
LMlsSt+MRBG, LMeBSt+MRBG, FLMls, & FLMeB) was, in general, more favorable when both fall 
and spring data were used. This was the case for both Reading and Mathematics. Predictions in 
Reading using fall and spring and spring only data had the least variance when the full empirical 
Bayes linear model (FLMeB) and the linear model using empirical Bayes estimates of end grade status 
and RIT block mean growth for the rate estimate (LmeBSt+MRBG). In Mathematics, the linear 
model estimating end grade status using ordinary least squares and RIT Block mean for the rate 
estimate (LMlsSt+MRBG) and the simple observed end grade status plus mean RIT block growth 
(MRBG) resulted in the lowest levels of residual variance. 

The linear models using empirical Bayes estimates of end grade status resulted in the most accurate 
(i.e., the highest percentage of cases within ±1 SEM) predictions in Reading when fall and spring data 
were used. For the spring only data, the full linear model using empirical Bayes estimates (FLMeB) 
resulted in the most desirable statistics overall. For the fall and spring data, the model using empirical 
Bayes and the model using ordinary least squares estimates of end grade status plus mean RIT block 
growth can be seen as the most. It lead to the most accurate predictions overall. 

Prediction accuracy in Cohort C mathematics was highest for the simple observed end grade status 
plus mean RIT block growth model (MRGB) was found to be the most effective overall, even though 
some of its indicator statistics were not optimal. This was the case for both the spring only and the fall 
and spring datasets. However, the two linear models that used RIT block growth as the estimate of 
rate yielded accuracy percentages that approached that of the MRBG model in the fall and spring 
dataset. The norms-based z-score models yielded the least accurate predictions by far, particularly for 
the fall and spring dataset. 




Table 6. Achievement Status Residuals by Method - Cohort C 
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Residual plots. 

Figures 1 through 4 present selected residual plots for Reading and Mathematics from the previous 
four sets of analyses (Cohorts A, B, C spring only, and C fall and spring). Each plot shows the 
resulting residuals from the selected model in relation to the final (observed) RIT scores. The plots 
selected for presentation were for the most parsimonious model in each set. For contrast and for 
illustrative purposes, the least parsimonious models for the same analysis set are presented in the 
lower portion of each figure. For purposes here, ‘most parsimonious’ refers to the model that resulted 
in the most favorable combination of low bias, low RMSE, low Tyresid RiTgi, and high percent of 
predictions within ±1 SEM. 

The plots require little explanation but would benefit from pointing out a few characteristics of what 
we would expect to see in a parsimonious model. These include: 

1 . A trend line that runs through the range of the plot at or very close to the zero level. 
This is illustrated well in the Figure 2, Mathematics, most parsimonious plot. 

2. When there is a positive or negative trend in the residuals, the difference between the 
most positive and most negative would be contained in a very narrow band. Figure 1 , 
Reading, most parsimonious illustrates this. 

3. Vertical scatter around the zero point would be compact, with the vast majority of 
residuals falling inside a narrow range (e.g., ± 10). Figure 2, for Mathematics, most 
parsimonious is the best example of this among the data sets. 

4. Scatter around the zero point trend line would be vertically symmetrical across the 
entire measurement range of the RIT scores. None of the figures represents this 
particularly well, but Figure 2 for Mathematics, most parsimonious comes closest. Lack 
of symmetry is an indication that the model differentially accurate across the measurement 
scale. 

Cohort A. The linear model with empirical Bayes estimates of end grade score plus RIT block growth 
was chosen as the most parsimonious for Reading. The plot for Reading (Figure 1) shows better 
predictions for scores above 200. More serious over-predictions (i.e., residuals < -10) were evident. 
For Mathematics, the linear model using ordinary least square estimates of grade status plus RIT block 
growth was selected. Again, the most discrepant residuals appeared at about RIT 225 and below.. 
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Cohort B. Predicting Reading using linear least squares to estimate end grade and RIT block mean to 
estimate rate was selected as the most parsimonious model, (see Figure 2) This model resulted in a 
very slight bias toward under-prediction. Discrepant over-predictions (< -.10) were distracting but 
relatively infrequent. The empirical Bayes version of the same model was selected as the most 
parsimonious for Mathematics. Its pattern of residuals was fairly symmetric around zero and 
generally clustered within the -10 to +10 RIT range. 

Cohort C, spring only data. The full linear model using empirical Bayes estimates was selected as the 
most parsimonious model for the Reading predictions. Even though this model resulted in a slight 
over-prediction bias (mean residual = -.31), its more severe under-predictions (residuals >10) were 
more common across the entire measurement range. This was similar to the most parsimonious model 
selected for Mathematics, the simple observed end grade plus RIT block growth model. Its more 
severe under-predictions occurred for scores in the 185-265 RIT range while its more severe over- 
predictions occurred in the 200-280 RIT range. 

Cohort C fall and spring data. The most parsimonious model for Reading was considered to be the 
linear model using ordinary least squares estimates for end grade status plus RIT block growth for a 
rate of growth estimate. The vast majority of its predictions fell within a 20 point band around zero. 
However, the severe over-predictions occurred for RIT scores in the 175-245 while the severe under- 
predictions were in the 190-255 RIT range of last observed scores. The model selected as the most 
parsimonious for Mathematics was the same as the one selected for the Cohort C, spring only data set. 
The comments made there apply to the fall and spring data set. 
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Figure 1. Residual plots of the most and least parsimonious models for Reading and Mathematics for Cohort A 
Reading Mathematics 
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Figure 2. Residual plots of the most and least parsimonious models for Reading and Mathematics for Cohort B 
Reading Mathematics 




snoiuouiisjHd )sop\[ 



snoiuouiisjBd 




Figure 3. Residual plots of the most and least parsimonious models for Reading and Mathematics for Cohort C, Spring Data Only 

Reading Mathematics 
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Figure 4. Residual plots of the most and least parsimonious models for Reading and Mathematics for Cohort C, Fall & Spring Data 

Reading Mathematics 
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Discussion 



This study was undertaken to evaluate models that could be used to set single-year individual student 
academic growth targets. Multiple terms of individual student reading and mathematics test results 
were analyzed to predict each student’s final status score in each subject. Test records from over 5300 
students in three cohorts were used; two cohorts of roughly 670 to 750 students and one cohort of 
roughly 4000 students. The two smaller cohorts were from the same school district; the larger one was 
from the 2002 NWEA Norming Study and represented nine school districts. Three terms of spring 
data were used to predict scores in a fourth spring term for the two smaller cohorts. For the larger 
cohort, four terms of fall data and four terms of spring data were used to predict scores in a fifth spring 
term. Also, the four terms of spring data were used independently to predict scores in the fifth spring 
term. 

The twelve models used to make predictions varied in the: a) treatment of data prior to the last 
‘observed’ score, b) nature of the last score [observed or estimated], and c) estimate of rate of growth 
used [linear, RIT block growth, ignored in the z-score models]. Of the 12 models applied to each of 
the eight data sets, five emerged as yielding the most parsimonious set of predictions. The predictions 
within ±1 SEM of the observed scores ranged from roughly 40 to 50 percent for these models. 
Corresponding percentages for the six least parsimonious models ranged from roughly 1 1 to 37 
percent. 

The prediction task here was intentionally restricted to using only available achievement test data. 
Were a traditional modeling or forecasting approach taken, additional data such as school or district 
characteristics (e.g., class size, curricular differences), or student characteristics (e.g., gender, 
ethnicity, level of poverty, English language status) could have been added to help model additional 
variance. For example, recalling that Cohort C was made of data from nine school districts, it is quite 
possible that a good portion of the variability in the Mathematics data could have been attributable to 
differences in mathematics course taking patterns between these districts. Taking such differences 
into account, may have improved prediction accuracy. However, even though they may improve 
prediction accuracy, these variables would typically not be feasible to include. In all likelihood they 
would be viewed as setting differential growth targets (expectations) based on school and/or student 
characteristics; current collective thought cannot reconcile this practice with the demands of the 
standards movement. 

In what might be considered a prophetic announcement of the results of this study, George E.P. Box 
(as cited in Sloane & Gorard, 2003), once opined, “All models are wrong, but some are useful.” Even 
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though the term ‘parsimonious’ has been used here to label particularly attractive sets of results for a 
model, the term could only be applied as a relative one. When the most parsimonious model 
accurately predicted (within 1 SEM) student status slightly less than 50 percent of the time, we can 
safely conclude that all these models are wrong, at least they are less accurate than we would like. 
However, this does not preclude the possibility that some of the models or model components may 
prove useful under specific conditions. What proves useful, may well depend on the characteristics of 
the data available to model. If a grade-independent scale can be assumed, the important characteristics 
for the models used here reduce to the quantity of data, the number of waves of data with common 
student test results, and variability in those data. 

A district that has only one or two waves of same-student data, could in the absence of stable growth 
norms, assign individual growth targets based on the grade level differences in status norms. This is 
consistent with current standards-based accountability systems; all students in a grade would be 
assigned the same growth targets. Considering the potential disruption this could cause, it is not a 
recommended approach. A more promising approach would be to gather one or two additional waves 
of data and then investigate one of the two local-based mean 2 -score models used with Cohort A. 

These models should work well when the number of students per grade level is about 500 or more and 
the score distributions are approximately normal. When grade level growth norms are available, these 
could be used immediately, though for individual student growth targets they are only a slight 
improvement over using grade level differences in status norms. At the individual student level, 
growth norms that are segmented based on initial score (e.g., RJT block mean growth), will typically 
result in more reasonable growth targets. 

When three or four waves of achievement data are available for making predictions, the range of 
options increases. Linear models that provide an estimate of the last (observed) score combined with a 
mean from segmented growth norms as a substitute for growth rate should be considered. Results 
from this study demonstrated that with short time series (e.g., 3 waves) the slope estimates of the 
linear models had low-moderate reliability (viz., .36 and .087 for Reading and Mathematics, 
respectively in Cohort A; and .12 and .31 for Reading and Mathematics, respectively in Cohort B). 

Use of the RIT block means in place of the rate estimates from the linear models, improved accuracy 
over the full linear models. In addition to the more complex models, the local-based mean 2 -score 
model could be explored when the conditions noted previously hold. 

With four or more waves of data, consideration can be given to the full linear models. However, the 
results here demonstrate that more waves of data don’t always yield the least biased or most accurate 
predictions, even though they are likely to be among the most accurate. A pattern of inconsistent term 
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level variances for a cohort can be considered a sign that linear models, or at least the linear models 
used here, may not lead to the most accurate results (see Cohort C). 

Explicitly including individual growth into a district or state level accountability system has the 
potential to expand the capability of the system by making it more comprehensive and more sensitive 
to the full range of academic change. To realize this potential, the expectations for academic change 
need to be generated from the perspective of the individual student. Research in this area is still 
immature and more research is clearly needed. However, even at this stage there is sufficient evidence 
to counter the unfortunate practice of declaring group growth targets in the absence of reasonable 
expectations for individual student growth - a practice that has been encouraged by status oriented 
accountability systems. 
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